Workshop FF UK 5.10.2023 3️⃣
Vybrané kapitoly z analýzy dat
LMU Munich
📫 renata.topinkova[at]lmu.de
🤔 Is there an API?
Isn’t there an R package for that?
📦 WHO, guardianapi, spotifyR, nytimes, wbstats, RedditExtractoR
Are you sure?
r fontawesome::fa("google", "black") Google, r fontawesome::fa("github", "black") Github
If you’re SURE sure… Generic package
📦 httr, httr2
extracting and copying data from a web page into a structured format using a computer program
Often also referred to as „ screenscraping” = scraping from a computer screen
✅ Works without API
✅ Flexible
✅ User’s behavior simulation (Selenium)
❌ Often complicated & frustrating (captcha, javascript)
❌ Needle in a haystack
❌ Data wrangling
❌ Websites can change
❌ Legally a gray area
Terms of Service, robots.txt, country laws, purpose, contacting a platform ➡️ a lot of uncertainty, gap between theory & practice, if unsure, seek professional advice
If there is an API available and you don’t need to simulate users’ behavior ➡️ go for API
= Policy that specifies rules about automated data collection on the site
Can be accessed by typing websitename/robots.txt in your browser
E.g.,: https://www.ted.com/robots.txt
Example: https://www.researchgate.net/terms-of-service
In connection with using or accessing the Service, you shall not:
Impose an unreasonable or disproportionately large administrative burden on ResearchGate
Use any robot, spider, scraper, data mining tools, data gathering and extraction tools, or other automated means to access our Service for any purpose, except with the prior express permission of ResearchGate in writing
Employ any mechanisms, software, or scripts when using the Service
Inspect (CZ: Prozkoumat)Ctrl + Shift + I)💡 You can also use SelectorGadget browser add-on to help you find relevant selectors
read_html("path") - reads in the entire website (STEP 1)
html_elements(x, css/xpath = "") finds element based on css/xpath
html_text() extracts text between tags
html_attr() finds attribute (mostly for <a href = ""> , i.e., links on pages)
html_table() reads in tables
Beware of html_element() same function as html_elements() but returns only 1 element!
Open the 03_1_Screenscraping_intro_exercise.qmd
Open the intro.html file.
Open the `03_1_Screenscraping_intro_exercise.qmd`
Webscraping v R 2023 - Renata Topinkova